Reproducible Large-Scale Neuroimaging Studies with the OpenMOLE Workflow Management System

نویسندگان

  • Jonathan Passerat-Palmbach
  • Romain Reuillon
  • Mathieu Leclaire
  • Antonios Makropoulos
  • Emma C. Robinson
  • Sarah Parisot
  • Daniel Rueckert
چکیده

OpenMOLE is a scientific workflow engine with a strong emphasis on workload distribution. Workflows are designed using a high level Domain Specific Language (DSL) built on top of Scala. It exposes natural parallelism constructs to easily delegate the workload resulting from a workflow to a wide range of distributed computing environments. OpenMOLE hides the complexity of designing complex experiments thanks to its DSL. Users can embed their own applications and scale their pipelines from a small prototype running on their desktop computer to a large-scale study harnessing distributed computing infrastructures, simply by changing a single line in the pipeline definition. The construction of the pipeline itself is decoupled from the execution context. The high-level DSL abstracts the underlying execution environment, contrary to classic shell-script based pipelines. These two aspects allow pipelines to be shared and studies to be replicated across different computing environments. Workflows can be run as traditional batch pipelines or coupled with OpenMOLE's advanced exploration methods in order to study the behavior of an application, or perform automatic parameter tuning. In this work, we briefly present the strong assets of OpenMOLE and detail recent improvements targeting re-executability of workflows across various Linux platforms. We have tightly coupled OpenMOLE with CARE, a standalone containerization solution that allows re-executing on a Linux host any application that has been packaged on another Linux host previously. The solution is evaluated against a Python-based pipeline involving packages such as scikit-learn as well as binary dependencies. All were packaged and re-executed successfully on various HPC environments, with identical numerical results (here prediction scores) obtained on each environment. Our results show that the pair formed by OpenMOLE and CARE is a reliable solution to generate reproducible results and re-executable pipelines. A demonstration of the flexibility of our solution showcases three neuroimaging pipelines harnessing distributed computing environments as heterogeneous as local clusters or the European Grid Infrastructure (EGI).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

OpenMOLE, a workflow engine specifically tailored for the distributed exploration of simulation models

Complex-systems describe multiple levels of collective structure and organization. In such systems, the emergence of global behaviour from local interactions is generally studied through large scale experiments on numerical models. This analysis generates important computation loads which require the use of multi-core servers, clusters or grid computing. Dealing with such large scale executions...

متن کامل

Towards Measuring the Project Management Process During Large Scale Software System Implementation Phase

Project management is an important factor to accomplish the decision to implement large-scale software systems (LSS) in a successful manner. The effective project management comes into play to plan, coordinate and control such a complex project. Project management factor has been argued as one of the important Critical Success Factor (CSF), which need to be measured and monitored carefully duri...

متن کامل

Bio-Swarm-Pipeline: A Light-Weight, Extensible Batch Processing System for Efficient Biomedical Data Processing

A streamlined scientific workflow system that can track the details of the data processing history is critical for the efficient handling of fundamental routines used in scientific research. In the scientific workflow research community, the information that describes the details of data processing history is referred to as "provenance" which plays an important role in most of the existing work...

متن کامل

Enabling scalable scientific workflow management in the Cloud

Cloud computing is gaining tremendous momentum in both academia and industry. In this context, we define the term “Cloud Workflow” as the specification, execution and provenance tracking of large-scale scientific workflows, as well as the management of data and computing resources to support the execution of large-scale scientific workflows in the Cloud. In this paper, we first analyze the gap ...

متن کامل

CORBA based Architecture for Large Scale Workflow

Standard client-server workflow management systems have an intrinsic scalability limitation: the central server is a bottleneck for large scale applications. It is also a single fault point that may disable the whole system. We propose a fully distributed architecture for workflow management systems. It is based on the idea that the case (an instance of the process) migrates from host to host, ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 11  شماره 

صفحات  -

تاریخ انتشار 2017